METHOD FOR GENERATING FIVE PRIME BIASED TANDEM TAG 

LIBRARIES OF cDNAs 



BACKGROUND OF THE INVENTION 

1. FIELD OF THE INVENTION 

The sequences of whole genomes from several organisms 
have now been elucidated and are available as searchable 
databases. This enables rapid identification of full- 
length messenger RNAs (mRNAs) expressed in a biological 
sample once a partial sequence is known. The method 
described here allows generation of such partial sequences 
consisting of a minimal length of expressed cDNA sequences 
of at least 20 bases from biological samples to rapidly 
identify novel expressed transcripts. 

2. DESCRIPTION OF THE RELATED ART 

In order to obtain a comprehensive collection of all 
human genes that are expressed, many millions of cDNA 
molecules must be sequenced^ which is quite costly and 
laborious. Since the availability of the human genome 
sequence^ much of the coding sequence of a gene can now be 
inferred once a short physical sequence is obtained. 
Hence, sequencing only a short stretch of cDNAs should be 
sufficient in theory to identify all genes expressed in a 
biological sample. The Expressed Sequence Tag (EST) method 
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purports to achieve this by generating for sequencing 
relatively short cDNA fragments from 3' ends. However, the 
EST method still utilizes one cDNA per clone, which means 
one sequencing reaction yields one cDNA sequence. 

An effective way to improve this yield so that each 
plasmid and each sequencing reaction yields many cDNA 
sequences is to "glue" together short cDNA fragments from 
end to end. The Serial Gene Expression Analysis (SAGE) 
method effectively utilizes such a concatenation procedure. 
The SAGE method, however, has two key shortcomings. One is 
that all of the tags are generated from a defined 3' end of 
a cDNA. Mammalian genes contain long untranslated 
sequences at their 3* ends, which make the determination of 
coding sequence by gene prediction algorithms difficult and 
unreliable. The second limitation is that the SAGE tags 
are typically only 14 bases long, which are too short to 
yield uniquely matching sequences from the genomic 
database. A minimum of 20 bases is needed to identify a 
uniquely matching gene from a mammalian genomic database at 
80% of the time. 

Thus, the most important prerequisite for obtaining 
expressed sequence tags to rapidly and uniquely identify 
coding sequences from a messenger RNA pool is to obtain 
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expressed sequence tags of 20 bases or longer from the 5' 
end of a coding region. Such tags then can be used as a 
forward PGR primer to easily amplify, sequence, and clone 
each gene uniquely. There is presently no method, which 
predictably generates 5^ cDNA fragments of 20-40 bases. 
The method described here generates one or more short tags 
at or near the 5' end of each gene transcript in tandem or 
in cluster so that when they are aligned against genomic 
Q sequences they together uniquely identify a contiguous 

!ff expressed sequence of 20 bases or greater. 

I y 

m 
In 

SUMMARY OF THE INVENTION 
ki The present application discloses a method for 

111 generating five prime biased tandem tag libraries of cDNAs. 

The method comprises the steps of isolating a sample of 
mRNAs; synthesizing double-stranded cDNAs from the mRNAs; 
blunt-ending the double-stranded cDNAs; attaching an 
adapter molecule to the blunt ends of the double stranded 
cDNAs to form a complex, where the adapter molecule is a 
double stranded, synthetic oligonucleotide comprising a 
recognition site for a type IIS restriction enzyme, a 
cloning site for releasing tags to a cloning vector, and 
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a PGR primer site; digesting the complex with a type IIS 
restriction enzyme to form released tags; separating the 
released tags from the double-stranded cDNAs; amplifying 
the released tags to form amplified tags; isolating the 
amplified tags; concatenating the amplified tags to form 
concatenated tags; amplifying the concatenated tags; and 
isolating the concatenated tags. 

In a preferred embodiment, the type IIS restriction 
enzyme is selected from the group consisting of Ear I, Sap 
I, Alw I, Bmr I, Bsa I, BsmA I, BsmB I, Mly I, Pie I, Bbs 

I, BciV I, Fau I, Mnl I, Aar I, BfuA I, BspM I, Hph I, Mbo 

II, SspD5 I, Sthl32 I, SfaN I, BseR I, BspCN I, Hga I, 
Acelll, Eci I, Taqll, Tthlllll, Bbv I, RleAI, Bcefl, Fok I, 
BceA I, BsmF I, StsI, Bce83I, Bpml, Bsg I, Eco57I, Eco57MI, 
and Mmel. In a more preferred embodiment, the type IIS 
restriction enzyme is Bpml . 

In another preferred embodiment, the mRNAs are from a 
mammal. In a more preferred embodiment, the mRNAs are from 
a human. 

In other preferred embodiments, the released tags are 
comprised of 50 nucleotides or less; the released tags are 
comprised of 36 nucleotides or less; the released tags are 
comprised of 32 nucleotides or less. In a more preferred 
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embodiment;, the released tags are comprised of at least 20 
nucleotides . 

In yet another preferred embodiment^ the method 
further comprises sequencing the isolated concatenated tags 
to obtain a nucleotide sequence and comparing the 
nucleotide sequence to a known nucleotide sequence. 

The present application also discloses a method for 
generating five prime biased tandem tag libraries of cDNAs, 
comprising the steps of isolating a sample of mRNAs; 
synthesizing double-stranded cDNAs from the mRNAs; blunt- 
ending the double-stranded cDNAs; attaching a first adapter 
molecule to the blunt ends of the double stranded cDNAs to 
form a first complex, where the first adapter molecule is a 
double stranded, synthetic oligonucleotide comprises a 
recognition site for a type IIS restriction enzyme, a 
cloning site for releasing tags to a cloning vector, and a 
PGR primer site; digesting the first complex with a type 
IIS restriction enzyme to form first released tags; 
separating the first released tags from the double-stranded 
cDNAs and attaching a second adapter molecule to the 
double-stranded cDNAs to form a second complex; amplifying 
the first released tags to form first amplified tags; 
isolating the first amplified tags; concatenating the first 
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amplified tags to form first concatenated tags; amplifying 
the first concatenated tags; isolating the first 
concatenated tags; digesting the second complex with a type 
IIS restriction enzyme to form second released tags; 
separating the second released tags from the double- 
stranded cDNAs; amplifying the second released tags to form 
second amplified tags; isolating the second amplified tags; 
concatenating the second amplified tags to form second 
concatenated tags; amplifying the second concatenated tags; 
and isolating the second concatenated tags. 

In a preferred embodiment, the type IIS restriction 
enzyme is selected from the group consisting of Ear I, Sap 
I, Alw I, Bmr I, Bsa I, BsmA I, BsmB I, Mly I, Pie I, Bbs 

I, BciV I, Fau I, Mnl I, Aar 1, BfuA I, BspM I, Hph I, Mbo 

II, SspD5 1, Sthl32 I, SfaN I, BseR I, BspCN I, Hga I, 
Acelll, Eci I, Taqll, Tthlllll, Bbv I, RleAI, Bcefl, Fok I, 
BceA I, BsmF I, StsI, Bce83I, Bpml, Bsg I, Eco57I, Eco57MI, 
and Mmel. In a more preferred embodiment, the type IIS 
restriction enzyme is Bpml. 

In another preferred embodiment, the mRNAs are from a 
mammal. In a more preferred embodiment, the mRNAs are from 
a human. 
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In other preferred embodiments, the released tags are 
comprised of 50 nucleotides or less; the released tags are 
comprised of 36 nucleotides or less; the released tags are 
comprised of 32 nucleotides or less. In a more preferred 
embodiment, the released tags are comprised of at least 20 
nucleotides . 

In yet another preferred embodiment, the method 
further comprises sequencing the isolated concatenated tags 
to obtain a nucleotide sequence and comparing the 
nucleotide sequence to a known nucleotide sequence, 

BRIEF DESCRIPTION OF THE DRAWINGS 
Figures lA, IB and IC show a flow chart of an 
embodiment of the present method for generating five primed 
biased tandem tag libraries of cDNAs. 

DETAILED DESCRIPTION 
A. Brief Description of the Method 

1. The first and second strand cDNA synthesis is 
carried out according the standard procedure. In a 
preferred embodiment, the first strand synthesis is carried 
out with olido-dT 3' primer covalently linked to magnetic 
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beads according to the manufacturer's protocol (Dynal 
Inc. ) . 

2. The 5' ends of the ds-cDNAs are flushed using T4 
DNA polymerase in the presence of dNTP, followed by the 
ligation of a double stranded adaptor. The adaptor can be 
of any sequence but contains the recognition sequence for a 
type IIS restriction enzyme that cleaves double stranded 
DNA substrates at some length downstream of the recognition 

O site. In a preferred embodiment, the recognition sequence 

for a type IIS enzyme, Bpm I (also known as Gsu I) was 

7^ placed at the 3' end of the adaptor so that the nucleotide 

11! 

^ sequence immediately following the Bpm I site is from 

■M 

yj cDNAs. In addition, optionally, the recognition site for a 

a 

yl rare six cutter such as the Mlu I enzyme can also be 

IV incorporated into the adaptor at just upstream of the Bpm I 

site to be utilized at a later step. The remaining adaptor 
sequence serves as the forward primer site for a subsequent 
PGR amplification step. 

3. The ligated adaptor-cDNAs are purified and then 
digested with Bpm I to release the 16bp cDNA tags plus the 
adaptor. The rest of the cDNAs remain bound to the 
magnetic beads and saved. 
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4. The adaptor-tag fragments are recovered by 
separating away the magnetic beads. They are ligated with 
a second adaptor of an arbitrary sequence but containing 
the same Mlu I site at the 5' end of the adaptor. These 
two adaptors also facilitate PGR amplification of the 
internal 16 bp cDNA tags. 

5. PGR amplification is carried out according to the 
standard procedure using the forward and reverse primers, 
which contain the sequences of the two adaptors 
respectively. The product is purified and ligated to a PGR 
cloning vector followed up by the transformation of 
competent bacteria, 

6. Plasmid harboring colonies are drug-selected. The 
plasmid DNA is purified and digested with Mlu I. The 
released tags plus the restriction sites (28 bp) are 
isolated and ligated to form concatamers. The concatmers 
of appropriate size, typically 0.5 Kb - 1.5 Kb, are 
fractionated by agarose gel-electrophoresis and then 
ligated into a Mlu I cut vector. After cloning, the 16 bp 
cDNA tags are elucidated by sequencing the concatemers. 

7. The remaining cDNAs bound to the magnetic beads 
from the step 3 are then processed again through steps 2 - 
6 to generate the second 16 bp tag from each cDNA. Thus, 
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after the two rounds, two tandem tags from the 5* end of 
each expressed transcript are generated, which, when 
aligned against the genomic sequence, generate 32 bases of 
combined sequence. 

8. Steps 2-6 can be repeated several times as 
necessary. 

B. More Detailed Description of the Method 
Step 1 : cDNA synthesis 

Total RNA was isolated from the HK 532 Cortical Cell 
Line using the Qiagen total RNA isolation kit (Qiagen, 
Inc., Valencia, CA) . Briefly, the cells were lysed in a 
lysis buffer followed by binding of the RNA to the Qiagen 
solid matrix, from which the RNA was eluted, precipitated 
and kept at -20 °C overnight. 

Messenger RNA (mRNA) , typically of 200 ng, was 
incubated with Dynal beads (Dynal, Inc., Lake Success, NY) 
containing oligo(dT) to attach the polyadenylated RNA which 
was converted into cDNA using the Superscript II cDNA 
synthesis kit (GIBCO Life Technologies, Gaithersburg, MD) 
according to the manufacturer's directions. 

Step 2 : Adaptor ligation 

After the second strand synthesis, the 5' ends of the 
double stranded-cDNA (ds-cDNA) were flushed using T4 DNA 
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polymerase. Oligonucleotide adaptors were created by 
mixing equimolar amount of each of two synthetic 
oligonucleotides 

sense strand: 

GCAGTGGTATCAACGCAGAGTCCAGTGTGGTGGACGCGTCTGGAG (SEQ ID N0:1) 
antisense strand: 

pCTCCAGACGCGTCCACCACACTGGACTCTGCGTTGATACCAC (SEQ ID NO: 2) 

in deionized water, heating them to 95 °C, and allowing 
them to cool slowly to room temperature to form: 

PGR primer site MluI_BpmI 

5 ' GCAGTGGTATCAACGCAGAGTCCAGTGTGGTGGACGCGTCTGGAG 

I I M 1 M M I I I 1 I M M I I I I I I I I I I I I I I I I I I I I I I I I 
CACCATAGTTGCGTCTCAGGTCACACCACCTGCGCAGACCTCp 

(SEQ ID NO: 3) 

Adaptor DNA (500 pmoles) was added to the solid-phase 
cDNA in a total volume of 50 pi of Ix ligase buffer 
containing 25 U of T4 ligase (Gibco BRL) . The reaction was 
incubated overnight at 16 ""C followed by 10 min at 65 ""C to 
inactivate the enzyme. 

Step 3 : Release and recovery of the first tag 
Beads were again washed extensively in wash buffer (5 
mM TrisHCl, pH 8.0, 0.5 mM EDTA, IM NaCl and 200 pg 
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BSA/ml) , followed by three washes in Bpml buffer, and 
resuspended in 50 ]al of Bpm I buffer containing 50 U of Bpm 
I and incubated at 37 ""C for 5 h with gentle rotation. The 
tag-containing supernatant was collected and the beads were 
washed twice with ICQ lal of reaction buffer 3 (NEBL, 
Beverly, MA) . The supernatant and washes were combined. 
The combined material was extracted with phenol: CIA. A 
half volume of 7 . 5 M ammonium acetate, or a one-third 
volume of 10 M ammonium acetate was added and DNA was 
precipitated with 2 volumes of ethanol in the presence of 4 
]al of glycogen (20 mg/ml) per 300 ijl of initial volume. 
Step 4 : Ligation of the 3^ adaptor 
A second, 16~fold degenerate adaptor molecule was 
prepared by annealing synthetic oligos as described above 

sense strand: 

pACGCGTGTCGACCTCGAGT (SEQ ID N0:4); 
antisense strand: 

TCTAGACTCGAGGTCGACACGCGTNN (SEQ ID NO: 5) 
to give the following oligodimer: 
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Mlu I PGR primer site 

pACGCGTGTCGACCTCGAGT 

I M I I I I I I I I I I I I I I I I 
NNTGCGCACAGCTGGAGCTCAGATCT (SEQ ID NO: 6) 

Five hundred pmol of adaptor were added to the tag DNA 
in a total volume of 50 ij1 of Ix ligase buffer containing 
lOU of T4 DNA ligase and incubated overnight at 16 °C. The 
ligase was inactivated by incubation at 65 "^C for 10 min. 

Step 5 : PGR amplification of the tags 

PGR amplification of the tags was carried out using 
sense and antisense primers designed to match the two 
adaptor sequences . 

The following primers were used: 

forward 

5" TGTAGACTCGAGGTGGACACGC (SEQ ID NO: 7) 
and reverse 

5' GGAGTGGTATGAACGCAGAGTCG (SEQ ID NO: 8) 

Step 6 : Tag concatenation 
The PGR product was electrophoresed on a 
polyacrylamide gel to isolate the 85 bp tag band. After 
phenol: CIA extraction and ethanol precipitation, the DNA 
was suspended in TE (pH 7.5), DNA was ligated with TA 
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cloning vector (In Vitrogen, Inc, Carlsbad^ CA) . 
Transformation was carried out according to the protocol 
provided by the manufacturer. 

Transformed E. coli cells were grown in 100 ml of 
ampicilin-containing Terrific Broth at 37 °C, shaken at 300 
rpm for 16 hr. Plasmid DNA preparation was carried out 
using Maxi kit (Qiagen Inc) . About 750 ]iq DNA was obtained 
which was suspended in 500 ]il of water. 
Q The digestion of the purified plasmid DNA was carried 

o 

^3 out in a volume of 750 pi using 2 Units of Mlu I per pg 

I U 

m Qf plasmid DNA for 4 hours. The resulting 28 bp tags were 

m 

i 'fx 

purified by electrophoresis on a 1.0% agarose gel in TAE 

o 

u\ buffer. 

|. ; 

ffj The 28 bp band was cut out of the gel^r and eluted 

fj using a freeze-thaw technique. The DNA was extracted with 

phenol: CIA and ethanol precipitated in the presence of 4 \xl 
glycogen and 100 pi of 10 M ammonium acetate per every 300 
pi of sample. DNA was then resuspended in 16 pi water. 

Concatemers were formed in a final volume of 20 pi 
using 1 pi of T4 DNA ligase (NEB, 400 units/pl) . 
Concatemers were fractionated on an agarose gel isolating 
greater than 500 bp fragments. The fragments were purified 
using the Qiaex (Qiagen, Valencia, CA) protocol following 
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the manufactures' s instructions. The large molecular 
weight concatemers were then ligated into Mlu I-digested, 
alkaline phosphatase-treated, pBlueScript plasmid in which 
an Mlu I site had been engineered. 
Results 

The accuracy with which one can align a short cDNA 
sequence to the genomic sequence depends upon the length of 
the cDNA sequence. This is illustrated in TABLE 1 below. 
Using the NCBI Database of 47,584 known and hypothetical 
mRNAs, short expressed sequences (tags) from the 5' 
end of mRNAs were extracted and aligned against the genomic 
database. The result clearly demonstrates that at least 20 
bases and preferably 32 bases or more of a contiguous 
sequence of mRNA are required to obtain a unique genomic 
match and thereby to identify a coding region from a 
genomic database. 
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TABLE 1: Effect of Tag Length on Unique G enomic Hits 



TAG 

LENGTH 


%TAGS WITH UNIQUE GENOMIC HIT 


14 


5.76 


16 


37.56 


18 


74.47 


20 


84.56 


32 


89.44 


36 


90.07 


40 


90.61 



However, currently, there is no enzyme, which can 
reproducibly generate 20 bases or longer fragments of 
double stranded cDNAs. We have developed a method to 
generate such expressed fragments. By obtaining one or 
more successive shorter fragments (tags) of 10-20 bases, 
which can then be aligned against the genomic sequence, the 
method generates two tandem tags which, in effect, produces 
a long contiguous sequence of 20 bases or greater. As a 
preferred embodiment, we have used an enzyme, Bpm I, which 
generates 16 base pair tags each time and 32 base pair 
tandem tags when aligned. A schematic outline of the 
method is shown in FIGURES lA, IB and IC. 

As an example, a tandem tag library, i.e., two 
successive tag libraries from a single cDNA sample, was 
generated from the mRNA of a human cortical neural stem 
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cell culture consisting of approximately 2 x lO"^ cells. The 
resulting tag libraries were sequenced, aligned against the 
human genomic database, and pairs of tags, which align 
perfectly end to end on the genomic sequence were 
identified as tandem tags. Some of the tandem tags are 
shown in TABLE 2 and TABLE 3. 

In TABLE 2, the two tandem 16-mer tags which uniquely 
and perfectly match known mRNA sequences are shown. The 
NCBI database of 47,584 known and hypothetical mRNAs was 
used as the template. In TABLE 3, the human genomic 
:2 database was used first as the template to generate tandem 

^- tags. These were then compared to the mRNA database to 

O 

hi verify whether the tandem tags indeed identified a coding 

ffl region. These tandem tags are also found to be tandem 

fy within a known mRNA. BLAST of mRNA sequence to the human 

genome reveals that tandem genomic alignment was correct in 

each case. 
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To further test the efficiency of the tandem tags to 
identify coding regions within the human genome, 400 random 
16-mers from the first tag library and 400 random 16-mers 
from the second tag library were selected. Tandem tags 
were identified from the genomic database. As shown in 
TABLE 4, the 32-mer tandem tags were vastly more efficient 
in zeroing on the uniquely matching coding region of the 
human genome than the individual 16-mer tags. 
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TABLE 4: Tandem vs. Non-tandem Efficiency 



TAGS 


GENOME 
MATCHES 


GCACTTTGGGAGGCCGGCTCACGCCTGTAATC 
(SEQ ID N0:41) 
GCACTTTGGGAGGCCG 
(SEQ ID NO: 42) 

GCTCACGCCTGTAATC 
(SEQ ID NO:43) 


1 

157,201 

170, 672 


CACGCCCGTAATCCCAAGCACTTTGGGAGGCT 

(SEQ ID NO:44) 
CACGCCCGTAATCCCA 

(SEQ ID NO:45) 

AGCACTTTGGGAGGCT 
(SEQ ID NO:46) 


1 

1,337 
132,561 


AGCACTTTGGGAGGCTGAGATCGAGACCATCC 

(SEQ ID NO:47) 
AGCACTTTGGGAGGCT 
(SEQ ID NO:48) 

G AGAT C GAG AGO AT C C 
(SEQ ID NO:49) 


2 

132,561 
66,177 


GCTTGAACCTGGGAGGGGAGGTTGCAGTGAGC 

(SEQ ID NO: 50) 

GCTTGAACCTGGGAGG 
(SEQ ID N0:51) 

GGAGGTTGCAGTGAGC 
(SEQ ID NO: 52) 


10 

62, 182 
162,173 


GGCCAACATGGCGAAACCCGTCTCTACTAAAA 
(SEQ ID NO:53) 
GGCCAACATGGCGAAA 
(SEQ ID NO:54) 

CCCGTCTCTACTAAAA 
(SEQ ID NO:55) 


47 

17, 111 
138,143 


GTGGAGCTTGCAGTGAGCCGAGATCGCGCCAC 

(SEQ ID NO:56) 
GTGGAGCTTGCAGTGA 
(SEQ ID NO: 57) 

GCCGAGATCGCGCCAC 
1 (SEQ ID NO: 58) 


1180 
14, 992 

20, 598 
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The key notion that two 16-mer tags can be aligned 
against the genomic database to identify a unique 32-mer 
coding sequence was further tested in silico in the 
following analysis. Using the set of 13,904 Unique RefSeq 
known mRNAs, two consecutive 16-mer tags were extracted 
near the 5' end of 1,000 mRNAs. These 16-mer tags were 
then pooled into a single ''bin'' to mimic a tag library. We 
then asked whether we could successfully recover, first, 
the tandem tags, and, second, the correct coding region by 
aligning the individual 16-mer tags against the human 
genome database. The 32 bp result set of tandem genome 
alignments was compared to the original 1,000 32bp known 
mRNA tandem. The results are summarized in TABLE 5 below. 

Approximately 75% of the 32-mer sequences could be 
recovered by the tandem method. The remaining 25% not 
found in the genome are most likely due to the gaps and 
incomplete sequences present in the current version of the 
human genome database. The false positives, which appear 
because two 16-mer tags paired up illegitimately, 
constituted about 2%. 



TABLE 5: In silico validation of the tandem tag method 



TEST 
# 


mRNA 

32-MER 

SET 


mRNA 

16-MER 

SET 


32-MER 
GENOME 
ALIGNMENT 

s 


DISTINC 
T 32- 
MER 

TANDEMS 


32-MER 

mRNAS 


GENOME 
FALSE 

rUbii IVE 

S 


1 


1000 
(995 

distinct) 


2000 

(1988 

distinct) 


35, 874 


727 


720 

(190 7995 = 
72. 4%) 


7 

(1/101 — 

0.96%) 


2 


1000 
(991 

distinct) 


2000 
(1982 

distinct) 


5,513 


746 


728 

^72ft/QQ1 - 
73.5%) 


18 

\L0 / / HO 

= 2.41%) 


3 


1000 
(993 

distinct) 


2000 

(1981 

distinct) 


•154, 854 


758 


752 

^759/QQ'^ - 
75.7%) 


6 

f / I — 
[O / / DO — 

0.79%) 


4 


1000 
(992 

distinct) 


2000 

(1981 
distinct) 


175, 420 


778 


770 

C77n/QQ9 - 
77.6%) 


8 

( Q / 1 1 Q — 
[0/1/0 — 

1.03%) 


5 


1000 
(990 

distinct) 


2000 
(1979 

distinct) 


910 


736 


729 
73. 6%) 


7 

\ f / 1 oo — 

0.95%) 


6 


1000 
(992 

distinct) 


2000 

(1984 

distinct) 


2, 642 


759 


739 

\ I 1 yyZ — 


20 

/on / n m n, 

(zu/ /by 

— Z • O ^ -6 J 


7 


1000 
(991 

distinct) 


2000 

(1982 
distinct) 


1,436 


735 


730 

(730/991 = 
7 3 6% ^ 


5 

(5/735 = 

U . O 0 ^ / 


8 


1000 
(992 

distinct) 


2000 

(1983 

distinct) 


184, 449 


753 


(742/992 = 
74.8%) 


1 1 

± J. 

(11/753 
= 1.46%) 
















AVG 
1-8 


992 

distinct 
sets 


1983 

distinct 
tags 




749 


74.5% 


1.365% 
















9 


3000 
(2960 

dist . ) 1 


6000 
(5913 

distinct) | 


177, 607 


2266 


2212 

(2212/2960 
= 4.7%) 


54 

(54/2266 
= 2.38%) 
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Tags once extracted from the sequenced concatemers are 
usually subjected to a clustering protocol to positively 
match the tags to known transcripts or to the human genome. 
This is done due to the redundant occurrence of some of the 
16 base pair tags within the genome, which does not allow 
the mining novel gene transcripts. Since the first set of 
tags and their tandem tags are generated from undefined 
ends of double-stranded cDNAs, each transcript is highly 
likely to generate multiple overlapping or closely spaced 
tags. Also, the number of such tags per transcript should 
eg be proportional to the relative abundance of the transcript 

in the sample. By aligning all tags against mRNA database 
and/or against the human genome, a stretch of physical 
sequence of the corresponding transcript is identified. 

An example of a clustering protocol is shown below. 
Prior to clustering, 16 bp tags were extracted from 
sequenced concatemers and aligned to FASTA files of human 
genome, mRNA, and EST sequence databases. The output from 
this alignment program yields an alignment table for each 
respective sequence database. Each row in the alignment 
table is an exact location where one of the tags was found 



a 

Q 

i5 



m 



13 
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in the sequence database (GenBank Accession, strand, 
sequence position) . 

Using the genome or mRNA alignment table, tag hits are 
clustered by scanning each sequence (genome contig or mRNA) 
to group tags that are proximal to each other. The 
clustering program accepts two criteria: maximum hit-to-hit 
distance and minimum number of tag hits needed to define a 
cluster. The program picks up the first tag alignment and 
places it into the cluster bin. It continues down the 
genome strand until it finds the next alignment. If its 
distance away from the last alignment placed in the cluster 
bin is less than the maximum hit-to-hit distance then it is 
placed in the cluster bin. Clustering is finished when the 
next hit is too far away or the program finishes scanning 
the genome contig strand. If the number of hits in the 
cluster bin are at least the minimum number set by the 
user, then a cluster is created and the program outputs to 
a table the cluster location and other relevant 
information. With an mRNA alignment table, the cluster 
program works exactly the same way except that it scans 
down each mRNA instead of a genomic contig. 



U.S. PATENT APPLICATION OF SAMAL ET AL. 

To ensure high quality clusters, in this example, a 
maximum hit-to-hit distance of no greater than the tag 
length (hits must be adjacent or overlapping) was used. 
Minimum cluster size was 3 hits. 

TAG CLUSTER EXAMPLES 

1) Clustering against mRNA transcript database 
(Refseq + Genome Annotation mRNAs) 



1 CLUST 


GENBGI 


BEGIN 


END 


NUM 


1 




POS 


POS 


TAGS 




450185 


1821 


1846 


6 




8 









S mRNA ID: 

>gi|4501858|ref |Niyi_001609.1| Homo sapiens acyl-Coenzyme A 
y dehydrogenase, short /branched chain (ACADSB) , nuclear gene 

ffl encoding mitochondrial protein, mRNA 

W (2682 bp) 

Location of transcript in Genome: 

NT_008926. 7 1 17472331 PLUS strand 

64789 - 64929 (1003 - 1143) 
66802 - 66906 (1142 - 1246) 
*67437 - 68879 (1243 - 2682) 

NT_027097.4 PLUS strand 

1770323 - 1770376 ( 4 - 57) 

1795662 - 1795822 ( 57 - 217) 
1799051 - 1799154 (215 - 318) 
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^matching genome cluster should be: 
68015 - 68040 (1821 - 1846). 



Clustering against Human Genome database: 



1 CLUSTID 


GENBGI 


STRAND 


BEGINPOS 


ENDPOS 


NUMTAGS 1 


3411961 


17472331 


PLUS 


68015 


68040 


6 1 



This corresponds with expected cluster location and size. 



2) rtiRNA Cluster (s) : 



1 CLUSTID 


GENBGI 


BEGINPOS 


ENDPOS 


NUMTAGS 


2 


4502010 


1364 


1396 


8 


3 


4502010 


1533 


1562 


7 


4 


4502010 


1587 


1623 


8 



>gi|4502010|ref |NM_000476.1| Homo sapiens adenylate kinase 



1 (AKl) , mRNA 
(2271 bp) 

mRNA matches Genome: 

NT_029366. 3 I 17449540 MINUS strand 

1803682 - 1803643 ( 1 - 40) 

1800671 - 1800631 ( 41 - 81) 

1799083 - 1799043 ( 80 - 120) 

1798874 - 1798709 (117 - 282) 

1797960 - 1797843 (281 - 398) 

1794533 - 1794339 (398 - 592) 

*1794098 - 1792410 (589 - 2271) 



*matching genome clusters should be: 

1793291 - 1793323 (1396 - 1364) 
1793125 - 1793150 (1562 - 1533) 
1793064 - 1793100 (1623 - 1587) 
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Genome Cluster (s): 



CLUSTID 


GENBGI 


STRAND 


BEGINPOS 


ENDPOS 


NUMTAGS 


1862419 


17449540 


MINUS 


1793062 


1793098 


8 


1862420 


17449540 


MINUS 


1793124 


1793153 


7 


1862422 


17449540 


MINUS 


1793289 


1793321 


8 



3) luRNA Cluster (s) 



CLUSTID 


GENBGI 


BEGINPOS 


ENDPOS 


NUMTAGS 


5 


4502042 


1927 


1959 


9 


6 


4502042 


2010 


2047 


6 


7 


4502042 


2058 


2131 


12 



>gi|4502042|ref |NM_000694.1| Homo sapiens aldehyde 
dehydrogenase 3 family, member Bl {ALDH3B1), mRNA 
(2790 bp) 

mRNA matches Genome: 

NT_008940. 7 1 17472907 PLUS strand 

1472982 - 1473028 ( 1 - 47) 
1477929 - 1478094 ( 44 - 209) 
1481160 - 1481272 ( 208 - 321) 
1481406 - 1481528 ( 320 - 442) 
1481798 - 1481889 ( 436 - 527) 
1482346 - 1482431 ( 525 - 610) 
1484116 - 1484504 ( 607 - 996) 
1485227 - 1485398 ( 996 - 1167) 
1488638 - 1488743 (1160 - 1265) 
*1490381 - 1491906 (1263 - 2790) 

^matching genome cluster (s) should be: 

1491045 - 1491077 (1927 - 1959) 

1491128 - 1491165 (2010 - 2047) 
1491176 - 1491249 (2058 - 2131) 
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Genome Cluster (s): 



1 CLUSTID 


GENBGI 


STRAND 


BEGINPOS 


ENDPOS 


NUMTAGS 


3473301 


17472907 


PLUS 


1491044 


1491076 


9 


3473302 


17472907 


PLUS 


1491127 


1491164 


6 


3473303 


17472907 


PLUS 


1491175 


1491248 


12 



4) mRNA Cluster (s) : 



CLUSTID 


GENBGI 


BEGINPOS 


ENDPOS 


NUMTAGS 


1 7012 


14786455 


2347 


2385 


12 



>gill4786455|ref |XM_009672.4| Homo sapiens 
phosphoenolpyruvate carboxykinase 1 (soluble) (PCKl) , mRNA 



(2642 letters) 
mRNA matches Genome: 



NT 011362 


.7 1 17484369 


PLUS 


strand 


21189036 


- 21189118 


( 1 


83) 


21189283 


- 21189548 


( 80 


- 345) 


21189983 


- 21190167 


( 345 


- 529) 


21190607 


- 21190812 


( 526 


- 731) 


21190941 


- 21191128 


( 732 


- 919) 


21191447 


- 21191642 


( 919 


- 1084) 


21192080 


- 21192307 


(1081 


- 1308) 


21192394 


- 21192529 


(1307 


- 1442) 


21192952 


- 21193049 


(1439 


- 1536) 


21193261 


- 21194369 


(1534 


- 2642) 



^matching genome cluster (s) should be: 
21194074 - 21194112 (2347 - 2385) 



Genome Cluster (s): 



CLUSTID 


GENBGI 


STRAND 


BEGINPOS 


ENDPOS 


NUMTAGS 


4332399 


17484369 


PLUS 


21194074 


21194112 


12 
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5) mRNA Cluster (s) ; 



CLUSTID 


GENBGI 


BEGINPOS 


ENDPOS 


NUMTAGS 1 


647 


5174710 


1385 


1410 


10 1 


648 


5174710 


1446 


1484 





>gi i 5174710 I ref |NM_005992 . 1 1 Homo sapiens T-box 1 (TBXl) , 



transcript variant B, mRNA 
(1538 bp) 

mRNA matches Genome: 

NT_011519. 9 1 17484914 PLUS strand 

2892106 - 2892148 (1 - 43) 

2894958 - 2895080 (41 - 163) 

2896306 - 2896684 (162 - 540) 

2898641 - 2898747 (537 - 643) 

2899557 - 2899729 (641 - 813) 

2900361 - 2900516 (814 - 969) 

2901160 - 2901229 (969 - 1038) 

2901304 - 2901406 (1037 - 1139) 

2918314 - 2918438 (1137 - 1261) 
*2918714 - 2918996 (1256 - 1538) 



^matching genome cluster (s) should be: 



2918843 - 2918868 (1385 - 1410) 
2918904 - 2918942 (1446 - 1484) 



Genome Cluster (s): 



CLUSTID 


GENBGI 


STRAND 


BEGINPOS 


ENDPOS 


NUMTAGS 


4343636 


17484914 


PLUS 


2918843 


2918868 


5 


4343637 


17484914 


PLUS 


2918904 


2918942 


10 



Occasionally, alignment of two tandem 16-mer tags on 
the human genome produced false 32-mer sequences that 
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probably do not exist in real transcripts. These represent 
a false-pairing against the human genome and are false- 
positives. Such false pairing can be reduced by using a 
second 5' adaptor containing two degenerate nucleotide 
bases. This example is shown below: 

Bpm I digestion 

5' . . . C T G G A G (N)16'^ ... 3' 

3' ... G A C C T C (N)14'^ ... 5' (SEQ ID NO:59) 

The first adaptor: 

GCAGTGGTATCAACGCAGAGTCCACGCGTCTGGAG 

I I I I M I M I I I I I I M M M I I I I I I I I I I I 
CACCATAGTTGCGTCTCAGGTGCGCAGACCTCp (SEQ ID NO: 3) 

The second adaptor with 2 nn on the 3' end of the 
first strand: 

GCAGTGGTATCAACGCAGAGTCCACGCGTCTGGAGNN 
I I I I I I I I I M I I I I I I M I I I I I I I I I I I I I 

CACCATAGTTGCGTCTCAGGTGCGCAGACCTCp (SEQ ID NO: 60) 

Bpm I digestion leaves 3 '-overhang of two nucleotides 
on the bottom strands of the leftover cDNA to which the 
second adaptor with two nn 3' overhang on the top strand is 
ligated. These two nucleotides are conserved in the second 
tag after second Bpm I cut. Hence the last two nucleotides 
of the first tag and the first two nucleotides of the 
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'putative' tandem tag are the same. This prevents the 
random matching of all the available tags to the first tag 
and decreases significantly the artificial combination 
between two random 16 mers. 

TABLE 6 below lists other type II restriction enzymes 
that generate short DNA fragments away from the recognition 
sites and could be used in this method. 
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TABLE 6: Type II restriction enzymes with asymmetric 
recognition sequences: 

Type II restriction enzymes 

Cuts after 4n Ear I, Sap I, 

Cuts after 5n Alw I, Bmr I, Bsa I, BsmA I, BsmB I, 

Mlyl, Plel, 

Cuts after 6n Bbs I, BciV I, Fau I, 

Cuts after 7n Mnl I, 

Cuts after 8n Aar I, BfuA I, BspM I, Hph I, Mbo II, 

SspDSI, Sthl32I, 

Cuts after 9n SfaN I, 

Cuts after lOn BseR I, BspCN I, Hga I, 

Cuts after lln Acelll, Eci I, Taqll, Tthlllll, 

Cuts after 12n Bbv I, RleAI, 

Cuts after 13n Bcefl, Fok I 

Cuts after 14n BceA I, BsmF I, StsI, 

Cuts after 16n Bce83I, Bpm I, Bsg I, Eco57I, Eco57MI, 

Cuts after 20n Mmel 
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While the invention has been described in connection 
with what is presently considered to be the most practical 
and preferred embodiments, it is to be understood that the 
invention is not limited to the disclosed embodiments, but 
on the contrary is intended to cover various modifications 
and equivalent arrangements included within the spirit and 
scope of the appended claims. 

Thus, it is to be understood that variations in the 
present invention can be made without departing from the 
novel aspects of this invention as defined in the claims. 
All patents and articles cited herein are hereby 
incorporated by reference in their entirety and relied 
upon. 



